Code
lex_raw <- read.csv("../stat331-FinalProject/lex.csv")
gdp_pcap_raw <- read.csv("../stat331-FinalProject/gdp_pcap.csv")
continents_raw <- read.csv("../stat331-FinalProject/country_by_continents.csv")All code, raw data, and project files are available in our GitHub Repository. Feel free to explore or replicate our analysis!
This analysis utilizes the life expectancy and the gross domestic product (GDP) datasets sourced Gapminder, a non-profit organization whose mission “is to fight devastating ignorance with a fact-based world view everyone could understand.” Their site provides data sets collected from many reputable sources and interactive visualizations on important world topics. Additionally, we used a countries by continent dataset from Daina Bouquin’s dataset posted on Kaggle. The continents dataset provides continent classifications for countries, allowing us conduct continent-specific analysis in our study.
We were interested in these data sets, as previous research has shown that GDP and life expectancy are related. A higher GDP per capita allows for better access to essential resources and services that improve health and longevity, such as better diet, healthcare, and sanitation. Additionally, GDP serves as an indicator of economic development. Countries with higher GDPs often have more advanced infrastructure and social services, which further extend life expectancy
lex_raw <- read.csv("../stat331-FinalProject/lex.csv")
gdp_pcap_raw <- read.csv("../stat331-FinalProject/gdp_pcap.csv")
continents_raw <- read.csv("../stat331-FinalProject/country_by_continents.csv")In the raw GDP dataset, some values included a “k” suffix to represent thousands of dollars (e.g., 10,000 to 10k). The first step is to figure out a way to convert GDP values into numeric form. To keep values constant, we created a function that converts these abbreviated values into their full numeric form, allowing for accurate numeric comparisons. Without this step, any observations containing a “k” would be dropped, leaving it empty and could potentially affecting later analysis.
clean_year <- function(str) {
str <- case_when(
str_detect(str, "k$") ~
as.numeric(str_replace_all(str, "k$", "")) * 1000,
.default = as.numeric(str)
)
return(str)
}The life expectancy data contains information about the life expectancy for 196 countries from the year 1800 to 2100. It provides the life expectancy in years for each country within the set. For the period from 1800 to 1970, the data was sourced from Gapminder’s main source v7: by Mattias Lindgren who assessed impacts from the biggest disasters in history in order to make rough estimates of life expectancy. Data for 1950-2019 was from the Global Burden of Disease Study 2019, which has 1950-2019 from the IHME. For 2020-2100, Gapminder used UN forecasts from the World Population Prospects 2022.
Life Expectancy Info from: https://www.gapminder.org/data/documentation/gd004
# pivot year columns to a single year and life expectancy column
lex <- lex_raw |>
pivot_longer(
cols = X1800:X2100,
names_to = "year",
values_to = "life_expectancy"
) |>
mutate(
year = str_sub(year, start = 2),
year = as.factor(year)
)The GDP data was obtained from the Madison Project Database (MPD) and Penn World Table (PWT). This data set contains information on gross domestic product (GDP) per person adjusted for differences in purchasing power in international dollars, and fixed 2017 prices. GDP per capita measures the value of everything a country produces during a year, divided by the number of people. We transformed the data to have columns containing the country, year, and GDP of interest.
GDP Info from: https://www.gapminder.org/data/documentation/gd001/
# pivot year columns to a single year and gdp column
gdp_pcap <- gdp_pcap_raw |>
mutate(across(
.cols = X1800:X2100,
.fns = ~ clean_year(.x)
)) |>
pivot_longer(
cols = X1800:X2100,
names_to = "year",
values_to = "gdp"
) |>
mutate(
year = str_sub(year, start = 2),
year = as.factor(year),
gdp = as.numeric(gdp)
)We transformed each of the individual year columns into one singular column so that the dataset would be easier to read. As a result, each observation consists of one country and year, with the corresponding life expectancy. The raw GDP data is similar to the life expectancy data in that each year has its own column. So we transformed the data in a similar way, making year its own column with its corresponding GDP.
# join both tables by country and year
gdp_lex <- lex |>
full_join(
gdp_pcap,
join_by(
country == country,
year == year
)
) |>
na.omit()
#creating a data frame to add continent
country_mapping <- tibble(
continent_name = c(
"Burkina", "Burma (Myanmar)", "Congo", "Czechia",
"Democratic Republic of Congo", "East Timor", "Hong Kong",
"Ivory Coast", "Kyrgyzstan", "Laos", "Macedonia", "Micronesia",
"Saint Kitts and Nevis", "Saint Lucia",
"Saint Vincent and the Grenadines", "Slovakia", "Swaziland",
"United Arab Emirates", "United Kingdom", "United States", "Palestine"
),
gdp_name = c(
"Burkina Faso", "Myanmar", "Congo, Rep.", "Czech Republic",
"Congo, Dem. Rep.", "Timor-Leste", "Hong Kong, China",
"Cote d'Ivoire", "Kyrgyz Republic", "Lao",
"North Macedonia", "Micronesia, Fed. Sts.", "St. Kitts and Nevis",
"St. Lucia", "St. Vincent and the Grenadines",
"Slovak Republic", "Eswatini", "UAE", "UK", "USA", "Palestine"
)
)
#altering continent to include Palestine
continents <- continents_raw |>
add_row(Continent = "Asia", Country = "Palestine") |>
left_join(country_mapping, by = c("Country" = "continent_name")) |>
mutate(combined = if_else(!is.na(gdp_name), gdp_name, Country))
# Join both datasets to assign continent value to all countries
gdp_lex <- left_join(gdp_lex, continents,
by = c("country" = "combined")
) |>
select(-country, -gdp_name) |>
mutate(Continent = as.factor(Continent))After cleaning up each data set, we had to join the two together by our observational unit, country. We hypothesize that as GDP increases, life expectancy will also begin to increase, as a higher GDP correlates to better infrastructure and more/better access to healthcare and medicine.
# Define color palette for continents
continent_colors <- c(
"Africa" = "#E41A1C",
"Asia" = "#377EB8",
"Europe" = "#4DAF4A",
"North America" = "#984EA3",
"Oceania" = "#FF7F00",
"South America" = "#FFFF33"
)
ggplot(gdp_lex, aes(x = gdp, y = life_expectancy, color = Continent)) +
geom_smooth(method = "loess", se = TRUE, alpha = 0.2) +
scale_x_log10() +
scale_color_manual(values = continent_colors) +
scale_alpha(range = c(0.2, 0.9), guide = "none") +
theme_minimal() +
labs(
title = "GDP per Capita vs Life Expectancy",
subtitle = "Non-linear relationship with confidence intervals by continent",
x = "GDP per Capita (log scale)",
y = "Life Expectancy (years)"
) +
theme(
legend.position = "right",
plot.title = element_text(face = "bold", size = 24),
axis.title = element_text(size = 18),
plot.subtitle = element_text(size = 18),
plot.caption = element_text(size = 14),
axis.text.x = element_text(size = 14),
axis.text.y = element_text(size = 14),
legend.title = element_text(size = 16),
legend.text = element_text(size = 14)
)gdp_lex <- gdp_lex |>
mutate(year = as.integer(year)) |>
mutate(year = 1800 + (year - min(year))) |>
drop_na()
ggplot(gdp_lex, aes(gdp, life_expectancy, color = Continent)) +
geom_point(alpha = 0.7, size = 2.5, show.legend = FALSE) +
scale_color_manual(values = continent_colors) +
scale_size(range = c(2, 12), guide = "none") +
scale_x_log10() +
facet_wrap(~Continent) +
theme_minimal() +
theme(
strip.text = element_text(face = "bold", size = 18),
plot.title = element_text(face = "bold", size = 24),
axis.title = element_text(size = 18),
plot.subtitle = element_text(size = 18),
plot.caption = element_text(size = 14),
legend.position = "bottom",
axis.text.x = element_text(size = 14),
axis.text.y = element_text(size = 14)
) +
labs(
title = "GDP per Capita vs Life Expectancy by Continent",
subtitle = "Year: {frame_time}",
x = "GDP per Capita (log scale)",
y = "Life Expectancy (years)",
caption = "Data source: Gapminder",
color = "Continent"
) +
transition_time(as.integer(year)) +
ease_aes("linear")gdp_lex_mean <- gdp_lex |>
group_by(Country) |>
summarize(
avg_gdp = mean(gdp, na.rm = TRUE),
avg_life_expectancy = mean(life_expectancy, na.rm = TRUE)
)
gdp_lex_lm <- lm(avg_life_expectancy ~ avg_gdp, data = gdp_lex_mean)To find the linear regression models for each continent, the average GDP and life expectancy was calculated for each country across all years. This way, each country has just one value for GDP (response variable) and one for life expectancy (explanatory variable).
models <- gdp_lex |>
group_by(Country) |>
mutate(
avg_gdp = mean(gdp, na.rm = TRUE),
avg_lex = mean(life_expectancy, na.rm = TRUE)
) |>
distinct(Country, .keep_all = TRUE) |>
select(-year) |>
ungroup() |>
group_by(Continent) |>
nest() |>
mutate(
model = map(data, ~ lm(avg_lex ~ avg_gdp, data = .x)),
summary = map(model, tidy)
) |>
select(Continent, summary) |>
unnest(summary)
models |>
gt(groupname_col = "Continent") |>
tab_header(
title = md("**Regression Model Estimates by Continent**"),
subtitle = "Average Life Expectancy vs Average GDP"
) |>
fmt_number(
columns = c(estimate, std.error, statistic, p.value),
decimals = 6
) |>
cols_label(
term = "Continent",
estimate = "Estimate",
std.error = "Std. Error",
statistic = "t-Statistic",
p.value = "p-Value"
) |>
tab_spanner(
label = "Model Estimates",
columns = c(estimate, std.error, statistic, p.value)
) |>
data_color(
columns = p.value,
colors = col_numeric(
palette = c("green", "yellow", "red"),
domain = c(0, 0.05, 1)
)
) |>
tab_style(
style = list(
cell_fill(color = "lightgray"),
cell_text(weight = "bold")
),
locations = cells_row_groups()
) |>
tab_footnote(
footnote = "P-values below 0.05 indicate statistical significance.",
locations = cells_column_labels(columns = p.value)
)| Regression Model Estimates by Continent | ||||
|---|---|---|---|---|
| Average Life Expectancy vs Average GDP | ||||
| Continent |
Model Estimates
|
|||
| Estimate | Std. Error | t-Statistic | p-Value1 | |
| Asia | ||||
| (Intercept) | 49.612363 | 0.967254 | 51.291992 | 0.000000 |
| avg_gdp | 0.000151 | 0.000054 | 2.826570 | 0.007052 |
| Africa | ||||
| (Intercept) | 45.545427 | 0.468580 | 97.198742 | 0.000000 |
| avg_gdp | 0.000608 | 0.000076 | 7.955113 | 0.000000 |
| Europe | ||||
| (Intercept) | 53.359873 | 0.980483 | 54.422023 | 0.000000 |
| avg_gdp | 0.000296 | 0.000031 | 9.637500 | 0.000000 |
| South America | ||||
| (Intercept) | 55.974115 | 1.968199 | 28.439260 | 0.000000 |
| avg_gdp | −0.000080 | 0.000135 | −0.594425 | 0.565433 |
| North America | ||||
| (Intercept) | 48.092682 | 1.902205 | 25.282599 | 0.000000 |
| avg_gdp | 0.000591 | 0.000114 | 5.162279 | 0.000041 |
| Oceania | ||||
| (Intercept) | 47.159429 | 2.647934 | 17.809898 | 0.000000 |
| avg_gdp | 0.000724 | 0.000177 | 4.099796 | 0.001473 |
| 1 P-values below 0.05 indicate statistical significance. | ||||
\[ \hat{y} = 49.6 + 0.000151x \]
Intercept: When the average GDP of Asia is $0, the average life expectancy in Asia is 49.6 years.
Slope: For each additional $1 increase in average GDP in Asia, the life expectancy of a person in Asia will increase by 0.000151 years.
\[ \hat{y} = 45.5 + 0.000608x \]
Intercept: When the average GDP of Africa is \(0\) dollars, the average life expectancy in Africa is \(45.5\) years.
Slope: For each additional \(1\) dollar increase in average GDP in Africa, the life expectancy of a person in Africa will increase by \(0.000608\) years.
\[ \hat{y} = 53.4 + 0.000296x \]
Intercept: When the average GDP of Europe is \(0\) dollars, the average life expectancy in Europe is \(53.4\) years.
Slope: For each additional \(1\) dollar increase in average GDP in Europe, the life expectancy of a person in Europe will increase by \(0.000296\) years.
\[ \hat{y} = 56 - 0.0000803x \]
Intercept: When the average GDP of South America is \(0\) dollars, the average life expectancy in South America is \(56\) years.
Slope: For each additional \(1\) dollar increase in average GDP in South America, the life expectancy of a person in South America will decrease by \(0.0000803\) years.
\[ \hat{y} = 48.1 + 0.000591x \]
Intercept: When the average GDP of North America is \(0\) dollars, the average life expectancy in North America is \(48.1\) years.
Slope: For each additional \(1\) dollar increase in average GDP in North America, the life expectancy of a person in North America will increase by \(0.000591\) years.
\[ \hat{y} = 47.2 + 0.000724x \]
Intercept: When the average GDP of Oceania is \(0\) dollars, the average life expectancy in Oceania is \(47.2\) years.
Slope: For each additional \(1\) dollar increase in average GDP in Oceania, the life expectancy of a person in Oceania will increase by \(0.000724\) years.
# The variance in the response values.
res_var <- var(augment(gdp_lex_lm)$avg_life_expectancy)
# variance in the fitted values from your regression model
fitted_var <- var(augment(gdp_lex_lm)$.fitted)
# The variance in the residuals from your regression model.
resid_var <- var(augment(gdp_lex_lm)$.resid)
# Calculate R-squared and proportion of variability explained
r_squared <- fitted_var / res_var
#create table
tibble("Response" = round(res_var, 2),
"Fitted Values" = round(fitted_var, 2),
"Residuals" = round(resid_var, 2),
"R²" = round(r_squared, 4)) |>
kable(caption = "Variances and Model Fit")| Response | Fitted Values | Residuals | R² |
|---|---|---|---|
| 51.99 | 29.45 | 22.54 | 0.5665 |
Our table demonstrates that the model provides a moderate fit to the data. Our proportion of variance, or R² value of 0.5665, indicates about 56.65% of the total variation in life expectancy across countries is explained by our GDP based model. This suggests that GDP plays a significant role in determining life expectancy, but is not the sole factor. The remaining 43.35% of the variation must be explained by other factors, such as healthcare spending, education levels, and environmental factors.